Skip to main content

Overview

This directory provides two primary pathways for deploying and running Large Language Models (LLMs) on the NYU Torch cluster: Hugging Face Transformers (for research/experimentation) and vLLM (for high-performance serving).

1. Basic Inference (Hugging Face)

This method is ideal for feature extraction, embeddings, or small-scale batch processing. Persistence is achieved by setting up the environment within an ext3 overlay that can be mounted to an Apptainer container.

Workflow: Environment: Launch an Apptainer container with a read/write overlay. Persistence: Install conda and libraries directly into /ext3. Execution: Use AutoModel to load weights and perform a forward pass. Key File: huggingface.py Ideal for: Extracting last_hidden_state (embeddings) or sentiment classification.

2. High-Performance Serving (vLLM)

vLLM is the recommended tool for production-level throughput and low-latency inference. It utilizes PagedAttention to manage memory efficiently. Please find our guide on deploying LLMs with vLLM on Torch here.

Why vLLM? Speed: higher throughput than standard backends on Torch. Compatibility: Drop-in replacement for OpenAI API.

Deployment Options: Online: Use vllm serve to start an HTTP server accessible via curl or OpenAI clients. Offline: Use the LLM class within Python for processing large datasets without a server.